Which influence has been stronger: the influence of AI
on the physiology of stereopsis, or the reverse?

Greg Detre

Thursday, 22 November, 2001

Prof. Andrew Parker

 

In forming a three-dimensional object-centred view of the world from our two-dimensional retinal images, we rely on both monocular and binocular cues. Unlike some animals (e.g. the rabbit), our eyes both face the same direction. This restricts the width of our visual field, but it allows us to use tiny disparities between the retinal images of our eyes to tell us how far away objects in front of us are. Stereopsis is this process of binocular comparison by which we perceive depth. Although our eyes are fairly close together, retinal disparities are significant enough to provide information about depth up to somewhere between 30-100[1] metres. It is most accurate for nearby objects (less than one metre), which probably relates to its immediate value in tool use.

In answering the question, we have to stipulate whether we're talking about progress in understanding and solving the general engineering problem of stereopsis, or understanding and replicating how the brain does it. After all, any robot project that interacts with the real world is likely to have two eyes and want to use retinal disparities arising from their different positions and viewpoints to garner depth information about the world. In such a project, there is no requirement that its solution to the stereopsis problem will bear any resemblance to the one that is implemented in our brains. It may be that such a system will be non-connectionist, or its eyes might not be horizontally level � either way, if biological plausibility is not a requirement, then there are various solutions which have been devised. However, if one is trying to understand how the brain solves the stereopsis problem (perhaps ultimately with the aim of improving upon its solution, perhaps just for its own sake), then the constraints and techniques will be very different. In following this latter understanding of the title, conformity with empirical evidence becomes the most important criterion in judging possible explanations.

Probably the best place to start in any discussion of AI and physiology relating to vision is with David Marr. He is credited with almost single-handedly establishing the field of computational neuroscience, putting forward a set of systems-level hypotheses (of which his theory of the cerebellum is still widely used) and making enormous leaps in vision research with a number of experimentally verifiable, quantifiable and plausible ideas about the workings of the visual system framed within his three-level approach to neuroscience research.

I will start by briefly outlining this three-level conceptual scheme of information-processing, then consider the proposals he makes for how stereopsis is implemented in the brain.

computational theory � an abstract formulation of what (describing the function relating inputs to outputs) is being computed and why (e.g. in terms of adaptive value)

algorithm � the particular algorithm that the system is using, and how the system encodes (represents) the information (especially what it makes explicit, and what it leaves implicit) � the how

implementation � what �hardware� (whether silicon-based or biological) is used to implement the algorithms

In terms of the visual system, its purpose is to allow us to �see�, i.e. �building a description of the shapes and positions of things from images�[2]. He set out four more controversial principles that structured his particular approach to vision research:

  1. Visual analysis should be characterised as bottom-up wherever possible, i.e. requiring as little high-level knowledge (e.g. that we are in an wide open space, or that we are underwater) about the scene as possible
  2. The visual system does make general assumptions about the way the world is, based on structural regularities in our environment, e.g. that most of the visual field is made up of smooth surfaces, that the light usually comes from above.
  3. Iterative algorithms are too slow, as a rule, to use in visual processing
  4. He sought an independent justification of a computational theory or algorithm, besides just psychological or neurophysiological data

Marr described a set of increasingly abstract stages of visual processing, culminating in a three-dimensional representation encoding what objects are present and where. He took the grey-level description as input (he ignored colour as being supplementary in most of his explanations), a retinal two-dimensional array (which I will term �pixels�) containing intensity information only. From this is derived the raw primal sketch, which symbolically encodes the location and orientation of edge segments, blobs and bars in the image. The full primal sketch groups these into boundaries. The 2�-D sketch is a short-term memory store that contains the orientation and approximate distance of surfaces from the viewer. The 3-D model provides an object-centred, three-dimensional representation of the objects, which can then be combined with top-down information to identify them.

The first stage of stereopsis is the correspondence problem. This is the problem of deciding which point on one retina maps to which point on the other retina. Julesz�s (1971) random dot stereograms show that this process does not require monocular object recognition. A random dot stereogram (RDS) consists of a pair of 2-D arrays of (usually black and white) randomised pixels, each fed to one eye. A section of one of the arrays has been slightly shifted to the right or left, giving rise to a binocular disparity that the brain interprets as a change in depth � that section of the image convincingly appears to be on a different plane to the viewer. Since there are no objects to be recognised in these images, stereopsis must be possible without any top-down information at all (although there is still some dispute over whether top-down information is sometimes utilised when available).

The correspondence problem is a problem because of �false targets�. For any pixel of a given intensity, there are likely to be many corresponding pixels of the same intensity with which it could be matched. The algorithm has to find a global optimum (matching the maximum possible number of pixels in the whole image) while not attempting to match every pixel with every other possible pixel. We are seeking an algorithm that probably works locally within the visual field (since information that can be gained from retinal disparity is restricted to the nearest distance the eyes can focus at), is ideally non-iterative, and will find the optimal (i.e. correct) solution extremely reliably.

Marr and Poggio�s stereopsis theory uses three general principles to help with the correspondence problem. They are easiest to understand if we imagine that the visual field is taken up with the image of a sphere with black spots on its surface.

  1. compatibility � match black pixels with black spots, and white with white
  2. uniqueness � any pixel in one image can only be matched with one pixel in the corresponding image
  3. continuity � distance from the observer should vary smoothly almost everywhere (except at boundaries of surfaces)

Marr and Poggio�s[3] first �co-operative� algorithm used inhibitory connections along the line of sight to ensure uniqueness (since there cannot be visible surface features at different depths along the same line of sight), and excitatory connections between contiguous pixels to ensure continuity. Since this relies on an iterative sort of constraint satisfaction procedure, and because it did not tally with psychophysical evidence, Marr discarded this early algorithm.

Marr and Poggio�s[4] second algorithm looks for matches between zero-crossings (of second-order derivatives) at four different levels of blurriness (using del-squared Gaussian filters). The most blurred images highlight coarse features, and can match disparities over fairly large areas (as is necessary when objects� depths could vary over a wide range of distances). Having roughly established the depth of the features, the same process can be carried out on a less blurred channel, using the first results as a guide, and so on for all four channels. The information from each channel guides vergence movements, so that �the range of disparities being processed by the next narrowest channel is always centred around zero�[5].

These ideas have since been developed in a number of directions. Mayhew and Frisby[6] proposed a third algorithm, with a binocular raw primal sketch. Their solution to the correspondence problem emphasised figural continuity and the relation between the outputs of the del-squared G filters that are used to identify edge segments and bar segments. Similarly, Prazdny replaced Marr�s continuity assumption with a coherence principle, which only required that neighbouring dispariites of elements corresponding to the same 3D object be similar, so that locally similar disparities should facilitate each other, while more distant and dissimilar disparities should not interact.

 

Neurophysiological approaches to stereopsis tend to be structured in terms of the anatomical area(s) being investigated. It seems clear that binocular information is not brought together earlier than V1 (see Xue et al, 1987 for a discussion of the lack of disparity-selective cells in the LGN), so neurophysiologists can focus on striate and extrastriate cortical areas.

One of the main problems facing researchers into the physiology of stereopsis is being sure that neuronal activity that seems to be correlated with psychophysical reports is actually involved in stereopsis processing. After V1, there are numerous areas which combine binocular inputs and can be broadly described as disparity-selective. Parker & Newsome (1988) stipulate three conditions for regarding a neuron as really contributing to the performance of a given stereoscopic task:

  1. neuronal activity should be recorded during performance of the task, and it should be shown that the candidate neurons are sufficiently sensitive to mediate task perforamcne
  2. neuronal activity should be shown to covary with perceptual judgements near psychophysical threshold
  3. artificial manipulation of neuronal activity (activation or suppression) should alter performance of the task

These criteria take into account the second difficulty of stereopsis experiments, that of isolating binocular from monocular effects. After all, whenever the visual signals are altered to test a stereoptic hypothesis, the input to one or both eyes necessarily changes too, and we need to be sure that we are not mistaking neuronal activity related to this monocular change for stereopsis-related processing. Random dot stereograms provide one means of distinguishing binocular from monocular effects, since any resulting disparity selective activity must relate to binocular correlation. Using a dichoptic bar stimulus in all possible combinations of positions in the two eyes (Ohzawa et al 1990) also differentiates between monocular and binocular effects.

V1 is the first area in the visual system where signals from both eyes are integrated. However, single V1 neurons do not seem to be able to account for the conscious perception of stereopsis (Cumming and Parker 1997), even though they do definitely signal the disparity of a stimulus, as shown initially with sweeping bar stimuli in anesthetized cats (Pettigrew et al 1968, Barlow et al 1967). In order to establish how complex V1 neurons� role in stereopsis processing is, anticorrelated RDSs (like a normal RDS, but with the contrast inverted in one of the images) are used, because �the pattern of local matches they produce appears to be rivalrous with no consistent depth, except at low dot densities (<5%)�. The disparity �energy� model proposes that complex cells fire when the simple cells are activated by a disparity. As predicted by this model, aRDSs give rise to an inverted disparity function response (Cumming and Parker 1997), indicating that V1 performs only a preliminary role in depth processing.

Disparity selectivity has been noted in monkeys in various extrastriate areas, including V2, V3, V3A, VP, MT, MST, IT and some visuomotor regions of the parietal and frontal cortex. Evidence from humans implicates all of these areas to a greater or lesser degree (e.g. Hubel and Livingstone (1987); DeAngelis and Newsome (1999); Uka et al. (2000)), including a notably ordered disparity representation in MT (DeAngelis, Cumming and Newsome (1998)). Vania�s broad overview of the neurological evidence indicates that occipital/parietal areas were important for establishing binocular correspondence but that occipital/temporal areas were needed to extract the three dimensional shape.

 

There are always certain major problems with comparing high-level GOFAI (�good old-fashioned AI� � see Haugeland 1981; Boden 1990) solutions with neurophysiological ones, arising out of a combination of ethical considerations, complexity and the standard of our measuring and observation techniques. The foremost problem is that connectionist solutions are distributed � often, functional organisation corresponds broadly with spatial organisation, but this is usually contingent on the nature of the neural developmental process. This makes it very difficult for human observers working with limited data to discern patterns. It can be very difficult to see the broader picture when working with single cells, while on the other hand, inferring low-level functionality from gross imaging data can be highly speculative. Relatedly, even with perfect data, we find it very difficult to interpret synaptic organisation in a way that we can understand (i.e. algorithmically), and so very often progress in neurophysiology is driven by speculative hypothesised algorithmic solutions to the general problem (e.g. Marr and Poggio�s algorithms, or the disparity energy model), which are then verified or falsified through careful experimentation.

Of course, AI and physiology meet in the middle, and can both be seen as aspects of, computational neuroscience. They can be defined either in terms of the techniques they employ, or alternatively in terms of how much they emphasise the biological plausibility of their solutions. When study of a cognitive function is in its earliest stages, neurophysiology serves mainly to add a provisional gloss to neuroanatomy, while AI researchers are able to look ahead to consider what the computational meat behind the phenomenon might be. Certainly this was the case two decades ago � we now have enough of an understanding of the problem, and sufficient variety of potential solutions, that neurophysiology has taken over in ascertaining which of these is actually implemented. Indeed, long after having laid the initial groundwork, during mature research AI usually follows empirical research, since history has repeatedly shown us that Mother Nature�s solutions tend to be elegant and powerful, and progress in coming up with a better, artificial solution usually benefits hugely from having a biologically-plausible basis.

 

 



[1] Kandel and Schwarz, Principles of Neural Science, 3rd edition, ch 30, pg 454

[2] David Marr, Vision (1982), pg 36

[3] Marr and Poggio (1976), �Co-operative computation of stereo disparity�, Science 194: 283-7

[4] Marr and Poggio (1979), �A computational theory of human stereo vision�, Proceedings of the Royal Society of London Series B, 204: 522-3

[5] Garnham, Artificial Intelligence (1988)

[6] Mayhew and Frisby (1981), �Psychophysical and computational studies twoards a theory of human stereopsis�, Artificial Intelligence, 17: 349-85